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Focus of this tutorial 

- Phenomena 

- Concepts 

- Approaches & Resources 

What is ‘Arabic’? 

- Arabic Script 

- Arabic Language 

• Modern Standard 
Arabic (MSA) 

• Arabic Dialects 




Road Map 



• Introduction 

• Orthography 

• Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 
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Road Map 



Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 



Arabic Script 



Modem 

Roman 



ABGDEFZH IKLMN OP QRST 



-h 



Early Latin AfcCCxf/^ZH IKLKN OT Q P <7 T 

1 I — } } } j — I — | | | | — — | - I - ♦ j I 'I — I — — — l — 

Greek AiUi-\ZE 7*111 on ©PIT 



Phoenician f L 7 1 fO^nwt 



Early 

Aramaic 

Nabatian 



Arabic 



O Mamoun Slid 1*997 



i *6 w 1 7 j loirpHt 



it jxitn i /ibvA-j j-ojvyappifA 

4 - 444 - 4 - 4 -^- 4 -U 444 — 



L -J X & 9 jXl=.S‘i=»J-Dj-uji-.9.t=>.9 J-ui-J 
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Arabic Script 



Arabic script is an alphabet with allographs variants, 
optional zero-width diacritics and common ligatures. 




Arabic script is used to write many languages: Arabic 
Persian, Kurdish, Urdu, Pashto, etc. 




Arabic Script 



Alphabet 

• letter forms 

l9 J J /) U 0 3 lS f 

• letter marks . .. ... c ^ 

• Arabic only . .. c 

• Other languages 

:: : v v I 

• Persian, Kurdish, 

Urdu, Pashto, etc. .. . ° 

V ■■ ■ 

• OCR output ambiguity 



IwM) U-U fjO )o P 






Arabic Script 



Alphabet (MSA) 




• letters (form+mark) . 


2 + ++ 


• Distinctive 


uuu 

% 


/;/ /S/ 


/0/ /t/ /b/ 


• Non-distinctive ^ 


Ls T tl 1 

£ 

/?/ 



glottal stop aka hamza 
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Arabic Script 



Letter Shapes 

• No distinction between print and handwriting 

• No capitalization 

• Right-to-left 

• Ambiguous 
shapes 

• Connective 
letters 

• Disconnective 
letters 



■ 

J 




1 


■ 

u 


u 

■ 


j 


P 


■ 

■ ■ 


■ 

£ 


Stand 

alone 


■ 

J 


j 

■ 




JO 


■ 

■ ■ 

_juJ 


■ 


initial 


■ 

> 


X 


L 


■ 

JL 


JL 

■ 




JX 


■ 

■ ■ 

-juUL 


■ 

SI 


medial 


O 


■ 


A 


p- 




■ 

& 


final 



Arabic Script 



Letter shaping 

= k-juS O dJ 

■ ■ ■ 

/katab/ b t k 

to write 

= I o dJ 

■ ■ ■ 

/kstab/ bat k 

book 
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Arabic Script 

Diacritics 

• Zero-width characters 

• Used for short vowels 

/katab/ to write 

• Nunation is used for 
nominal indefinite 
marker in MSA 

— 

/kitabun/ a book 



Nunation 




Vowel 








♦ 




♦ 


/ban/ 




/ba/ 








♦ 




♦ 


/bun/ 




/bu/ 






^ ♦ 


/bin/ 




/bi/ 
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Arabic Script 



Diacritics 



• No-vowel marker ( sukun ) 

— o — 

/maktab/ office 

• Double consonant marker 
(shadda) 

Cj — 

/kattab/ to dictate 



• Combinable 






/bbu/ /bbin/ /bban/ 
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Arabic Script 



Putting it together 

Simple combination 

Arab A'arab/ - v> c v j t 

O — O — 

West /barb/ v> c - v> c 4- v j & 

Ligatures 

J 

13 



Peace /salam/ 



jo\LuJ ioLLuJ >0 

X 



Arabic Script 



Tatweel 

• 'elongation' 

• aka kashida 

• used for text highlight ^ 

and justification 




human rights /huquq al?insan/ 
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Arabic Script 



• Different styles 

• High fluidity 

• Optional ligatures 

• Vertical 
arrangements 



Arabic 


Muhammad 


algebra 


♦♦ * 






♦♦ 








JuOJ^jO 






UA^A 




/9arabi/ 


/mufiammad / 


/alrfeabr/ 
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Arabic Script 

"Arabic" Numerals 

• Decimal system 

• Numbers written left-to-right in right-to-left text 

jjjJi ^ uip 132 Juu 1962 

Algeria achieved its independence in 1962 after i 32 years of French occupation. 

• Three systems of enumeration symbols that vary by region 



Western Arabic 

Tunisia, Morocco, etc. 


0 


i 


2 


3 


4 


5 


6 


7 


8 


9 


Indo-Arabic 

Middle East 


% 




\ 


r 


i 


0 




V 


A 


3 


Eastern Indo-Arabic 

Iran, Pakistan, etc. 


% 


> 


Y 


r 


f 


b 


f 


V 


A 


3 



Road Map 



Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 



MSA Phonology and Spelling 



• Phonological profile of Standard Arabic 

- 28 Consonants 

- 3 short vowels, 3 long vowels, 2 diphthongs 

• Arabic spelling is mostly phonemic ... 

- Letter-sound correspondence 




ljuwhnm lkqfB^Std S/szr 8 dxhd 30 tba? 
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MSA Phonology and Spelling 



• Arabic spelling is mostly phonemic ... 

Except for 

• Medial short vowels can only appear as 
diacritics 

• Diacritics are optional in most written text 

- Except in holy scripture 

- Present diacritics mark syntactic/semantic 
distinctions 

• /katab/ to write /kutib/ to be written 

• /hubb/ love /habb/ seed 

• Dual use of j, g? as consonant and long vowel 

- ' (/7,/a/) j (/w/,/u/) y; (/]/,/!/) 



19 




MSA Phonology and Spelling 



• Arabic spelling is mostly phonemic ... 

Except for (continued) 

• Morphophonemic characters 

- Feminine marker® ( ta marbuta) 

• /kablr/ (big S) "* /kablra / (big $) 

- Derivation marker 

• A'asa/ (to disobey (a stick 

• Hamza variants (6 characters for one phoneme!) 

- (Lsj)'' *) ^4? »j 4 /baha’/ + 3MascSing (his glory) 
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MSA Phonology and Spelling 



• Arabic spelling can be ambiguous 

- optional diacritics and dual use of letter 

• But how ambiguous? Really? 

• Classic example 

ths s wht n rbc txt Iks Ik wth n vwls 

this is what an Arabic text looks like with no vowels 

• Not exactly true 

- Long vowels are always written 

- Initial vowels are represented by an I ‘alef 

- Some final short vowels are represented 

ths is wht an Arbc txt Iks lik wth no vwls 
Will revisit ambiguity in more detail again under morphology discussion 
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Road Map 



Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 



Arabic Script 
Other languages 



Arabic 

• No more than 3 dots 

• Dots either above or below 

• Marks are 1 / 2 / 3 dots, hamza (s) 
or madda (~) only 

• Rare borrowing for foreign words 

• v/p/, ^ M, ^ s /g/, s /t s/ 

• regionally variable 




Not Arabic 

• Extra marks: haft (v), ring (o), taa 
four dots vertical dots (:) 

• Some Numerals 




Once you learn the alphabet, it is easier © 
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uU AS AJ J i ypj Aj Sj aSa- Aj>j 

( ' * kJUiti U<i^j c^-i) («j 4 j<U IjjAj <^£$j 

w • # 

( T } Jri^ cjJ a*^jj ^>3 AjJ Ad jld* jL? j Am aJ$j 
H ^ cri * Ad cf^ Aj ^ [£j$a •? j Ad AJ 

J iiJ^ Aj (*iJ AJF ^AaJ A^ Aj AJ Lj j AJ AJ$J 
yjlj Aw AJ |^ A^ ^ A^-i A^ AJ AJ^S j Aj AJ>J 
( t ) (qJ^ ^IjLj ^ ii UC U ^33 A it cfj$j 
*0 AA *£>** AJ *U A*JJ AS ^ A^ AJ jd ^Jjj 

) i£j*A^ ^Li«Jd ^ jJLd ^ Ad Am AJ 

bjlj A^ AjL?- ^ AJ Ad J^S JUj * A£ Id A^ AJ 

•• • • 

< "\ ) 4 Aj Ad^od J AjAJ AS uu>d$d j j A* <^J ^Aj$ cSj A^ 

(V) H'-fiJ l^jU JAaaAj <U 

(A) i<U cri>ii -3 J AJ <U AS ^U, JiAa JA* £ jjaLm 
( ^ ) >_jLj £A» Aa AJ JjV AA £)A aj *£ i j Jj Lj 



□ Arabic 

□ Not Arabic 



« A « 



*« a < i . j r j f 



< r 







□ Arabic 

□ Not Arabic 



c_all ^LLj 

Ui JliLIj 

£ 

c q n«o A*_j ^gjl. jjuo tilj j 

C ^ uJoiM 

■ ■■t stjp ' ii ...(J^“ 
^ ^ jjiaj lW=-'j 
^Uj 

jjiJIj lJjj^Ij jj=J! ‘-SJt j 3^1 

<jLlj ^j>4 djlS-L^al! (Jjuj^j! ^ j 
A^\ Ja^Ll ^»La! >.s-i\ Vj 







□ Arabic 

□ Not Arabic 





Road Map 



Introduction 

Orthography 

- Arabic Script 

- MSA Phonology and Spelling 

- Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/... 

- Encoding Issues 

Morphology 

Syntax 

Machine Translation Issues 
Dialects 



Encoding Issues 



Encoding Arabic 

- Data entry, storage, and display 

- Ease of use for Arabic-illiterate users 

- Multi-script support 

- Multilingual support (extended Arabic characters) 

Types of Encoding 

- Machine character sets 

• Graphemic (shape insensitive, logical order) 

• Allographs (shape/direction sensitive) [obsolete] 

- Human accessible 

• Transliteration 

• Phonetic spelling (IPA) 

• Romanization 




Encoding Issues 

• Many Conflicting Character Sets for Arabic 






Encodings 



• CP-1256 

- Commonly used 

- 1 -byte characters 

- Widely supported 
input/display 

- Minimal support for 
extended Arabic 
characters 

- bi-script support 
(Roman/Arabic) 

- Tri-lingual support: 
Arabic, French, 
English (ala ANSI) 



Codepage 1256 - Arabic Windows 





o 

1 


-1 


-2 


CO 

1 


-4 


to 

1 


-6 


-7 


1 

OO 


1 

CD 


-A 


-B 


-C 


-D 


1 

m 


-F 


0 

1 




0001 


0002 


0003 


0004 


0005 


0006 


0007 


0008 


0009 


00 0A 


000 B 


OOOC 


oooo 


000€ 


ODOF 


1- 


0010 


0011 


0012 


0013 


0014 


0015 


0016 


0017 


0018 


0019 


001A 


001 B 


001C 


0010 


001E 


001F 


1 

CM 




y 

• 


It 


# 


$ 


% 


& 


1 


( 


) 


* 


+ 


9 


- 




/ 




0020 


0021 


0022 


0023 


0024 


0025 


0026 


0027 


0028 


0029 


002A 


002 B 


002C 


0020 


002E 


002F 


3- 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


• 

• 


5 


< 


= 


> 


9 




0030 


0031 


0032 


0033 


0034 


0035 


0036 


0037 


0038 


0039 


003A 


003B 


003C 


0030 


003£ 


003F 


4- 


@ 


A 


B 


c 


D 


E 


F 


G 


H 


I 


J 


K 


L 


M 


N 


o 




0040 


0041 


0042 


0043 


0044 


0045 


0046 


0047 


0048 


0049 


004A 


004 B 


004 c 


0040 


004E 


004F 


5- 


P 


Q 


R 


S 


T 


u 


V 


w 


X 


Y 


z 


[ 


\ 


] 


A 






0060 


0051 


0052 


0053 


0064 


0055 


0056 


0067 


0058 


0059 


005A 


0058 


006C 


0050 


005E 


005F 


CD 

i 


V 


a 


b 


C 


d 


e 


f 


g 


h 


i 


• 

J 


k 


1 


m 


n 


O 




0060 


0061 


0062 


0063 


0064 


0065 


0066 


0067 


0068 


0069 


0064 


0068 


006C 


0060 


006E 


006F 


7- 


P 


q 


r 


S 


t 


U 


V 


W 


X 


y 


Z 


{ 


1 


} 


** 






0070 


0071 


0072 


0073 


0074 


0075 


0076 


0077 


0078 


0079 


007A 


007B 


007C 


007D 


007E 


007F 


CO 


€ 


* 


> 


/ 


99 


... 


t 


1 


A 


%<! 




< 


CK 


V 


A 

J 




20 AC 


067E 


201A 


0192 


201 E 


2026 


2020 


2021 


02C6 


2030 


0084 


2039 


0152 


0686 


0698 


OOBF 


CO 

1 


& 


< 


9 


66 


99 


• 


— 


— 




TM 




> 


(E 


ZNJ 


ZJ 






06AF 


2018 


2019 


201 C 


201D 


2022 


2013 


2014 


0098 


2122 


0O9A 


203 A 


0153 


200C 


200 D 


009F 


> 

1 




t 


0 


£ 


a 


¥ 


j 


§ 


•• 


© 




« 


“1 


. 


® 


- 




00 AO 


oeoc 


00A2 


00 A3 


00A4 


00A5 


00 A6 


00A7 


00 A8 


00A9 




00 A B 


00 AC 


00AO 


00AE 


OOAF 


B- 


o 


± 


2 


3 


* 


P 


H 


. 




• 


4 


» 


V4 




% 


? 




ooeo 


GOBI 


0062 


0OB3 


0064 


00B5 


00 B6 


0067 


00 BB 


0069 


06 IB 


OOBB 


ooec 


00 BO 


00BE 


061F 


C- 






i 


1 


i 


1 

* 


C $ 


1 


o 


3 


O 


£ 


c 


e 


i 


J 






0621 


0622 


0623 


0624 


0625 


0626 


0627 


0628 


0629 


062A 


062B 


062C 


0620 


062E 


062 F 


D- 


j 


J 


j 




A 






X 


b 


b 


t 


i 


_ 




3 


Cl 




0630 


0631 


0632 


0633 


0634 


0635 


0636 


0007 


0637 


0638 


0639 


063A 


0640 


0641 


0642 


0643 


E- 


X 

a 


J 


a 




j 


<1 


3 


9 


e 


✓ 

e 


A 

e 


e 


6 


4 


i 


1 




OG€0 


0644 


00€2 


0645 


0646 


0647 


0648 


00E7 


OOE8 


00€9 


00 E A 


00€B 


0649 


064A 


00EE 


OOEF 


F- 


4 


>/ 


4 


✓ 


A 

O 


> 




-1. 


w 


x 

u 


* 


<3 


ii 


LRM 


LRM 






0646 


064C 


064 D 


064E 


00F4 


064F 


0650 


00F7 


0651 


00F9 


0652 


00FB 


00FC 


200 E 


200F 





3U 



Encodings 



• Unicode 

- Becoming the 
standard more and 
more 

- 2-byte characters 

- Widely supported 
input/display 

- Supports extended 
Arabic characters 

- Multi-script 
representation 
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Encodings 



FE70 



Arabic Presentation Forms-B 



FEFF 



• Unicode 

- Supports presentation 
forms (shapes and 
ligatures) 





FE7 


FE8 


FE9 


FEA 


FEB 


FEC 


FED 


FEE 


FEF 


0 






w- 


>Z 


> 

V 


Jz 


A 


It 






FE70 


FE80 


FE90 


FEAO 


O0 


FECO 


FEOO 


FEEO 


FEFO 


1 


FE71 


T 

FE81 


) 

FE91 


T 

L, 

FEA1 


J - 

Ol 


FECI 


vJ> 

FED1 


r 

FEE1 


J 

FEF1 


2 


FE72 


\ 

FE82 


FE92 


c 

FEA2 


02 


\ 

J a 

FEC2 


FED2 


( V 

FEE2 


L? 

FEF2 


3 


Si 


ft 

1 


i 






\ 


« 




) 




FE73 


FE83 


FE93 


FEA3 


03 


FEC3 


FED3 


FEE3 


FEF3 


4 




ft 

\ 


a. 






la 


A 


At 


- 




FE74 


FE84 


FE94 


FEA4 


04 


FEC4 


FED4 


FEE4 


FEF4 



FC40 



Arabic Presentation Forins-A 



FD1F 





FC4 


FC5 


FC6 


FC7 


FC8 


FC9 


FCA 


FCB 


FCC 


FCD 


FCE 


FCF 


FDO 


FD1 


0 


\ 

J- 


S 


* 


> 


K 




if. 


,At»* 

> 


.» 


£ 


ft 

f 






rtb 




FC« 


FC50 


Fceo 


FC70 


FC80 


FC90 


FCAO 


FC80 


FCCO 


FCOO 


FCEO 


FCFO 


FDOO 


FD10 


1 


4 


4 

c. 


j 


JA 


< 

w 




4 




ft 


> 

X 


£ 






t 

J 2 




FC4I 


FC51 


FC61 


FC71 


FC81 


FC91 


FCA1 


FCB1 


Fca 


FCD1 


FCE1 


FCF1 


FD01 


FD1 1 


2 


f 


f 


V* 




f 




4 




» 


4 


t 


— 




1 

l 

J 2 




FC42 


FC52 


FC62 


FC72 


FC82 


FC92 


FCA2 


FCB2 


FCC2 


FCD2 


FCE2 


FCF2 


FDO 2 


FD12 


3 


\ 

u> 


J* 


t 


iJ 


6 


(C 


£ 


X 


ft 


4 


Jf 


} 








FC43 


FC53 


FC63 


FC73 


FC83 


FC93 


FCA3 


FCB3 


FCC3 


FCQ3 


FCE3 


FCF3 


FD03 


FD13 
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Encoding Issues 

Arabic Display 

• Memory (logical order) -> 

OQNEE YaO0ia (Palestine) Yi QasaaEiQI (Olympics) 2000 as 2004. 

JjljdJcj (Palestine) dg; l j J pu^ l j (Olympics) 2000 j 2004. 

or this way for those with direction-bias 

.4002 as 0002 ) scipmylO ( IQiEaaasQ iY ) enitselaP ( ai0OaY EENQO 

.4002 j 0002 ) scipmylO ( jii ) enitselaP ( odj l $ 
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Encoding Issues 

Arabic Display 

• Memory (logical order) 

OQNEE YaO01a (Palestine) Yi QasaaElQI (Olympics) 2000 as 2004. 
JjljdJcj (Palestine) dg; l j J pu^ l j (Olympics) 2000 j 2004. 

• Display (visual order) 

- Bidirectional (BiDi) support 

• Numbers and Roman script 

.2004 j 2000 (Olympics) ol^opjjl g;<J (Palestine) cjEj l $ 

- Letter and ligature shaping 

.2004 j 2000 (Olympics) o l j (Palestine) Lnlj Li 
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Display Problems 





Display Encoding 


CP-1256 


ISO-8859 


Unicode 


Western 


Actual Encoding 


CP-1256 


^ S ja. Ailala 

Sj ball 


&& JAs5[]S S 
cJli) Sjlauifl 


yO 0000 (fa 
qinnOffainAD 


ElOia aa0&E INE 
Yi IEi aaEigNE 
QaQaBENaaaiE 


ISO-8859 


oS-tfej* jXa S 

Je SjLaaH 

j-iAjUeje® 


d 4j&La 

SjlaaU 


YD^gt 

£f0 3 ipD QOtfGG 

®s m 


ElOeae aaexaE INE 
ae lEe aaEiQNE 
QaQaaENeaeeE 


Unicode 


Y» ■j'ialSa'Is Ja^Ja? 

©laiL-L ©la^-Iafla. . .la 

,My~U~U Afe.fe 

•j 

©la+Lgia-iiafla^ia^ia 

iaaia-J-Ja^Ja+ia^Ja© 


is? 1 LliLJaL^D^D 
^Ois ^Ojs 
is-isLisL 

isLisL^O 

gDgOls Lis<isLisLisL 
Is L £_Dio L ^D^Dib ib ^ 
D t D t Djin 


(jh Oj^ Aj^SUA (jjJuJj 

4ajjjaiiVI SjLaali 


i»^0 a 0~ 0 USUf 
U...Ut0 * U, 0© 
0-0±0© uDus 

0“0"US 

U„U„0 a 0-.0§0±0© 

0§U„0§U„Uf0 a 0±U 

~UfUS0© 



Wrong encoding 



Partial support problems 
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Encoding Issues 

Arabic Input 



Standard graphemic keyboard 
Logical order input 



' i 
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■ 
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p i 


/ 
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http://www.cyrillic.com/kbd/btc.html 























Encodings 

Buckwalter Encoding 

• Romanization 

- One-to-one mapping 
to Arabic script spelling 

- Left-to-right 

- Easy to learn/use 

- Human & machine compatible 

• Commonly used in NLP 

- Penn Arabic Tree Bank 

• Some characters can be 
modified to allow use with XML 
and regular expressions 

• Roman input/display 

• Monolingual encoding (can’t do 
English and Arabic) 

• Minimal support for extended 
Arabic characters 
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Road Map 



• Introduction 

• Orthography 

• Morphology 

- Derivational Morphology 

- Inflectional Morphology 

- Morphological Ambiguity 

- Arabic Computational Morphology 

• Syntax 

• Machine Translation Issues 

• Dialects 



Morphology 



• Type 

- Concatenative: prefix, suffix, circumfix 
-Templatic: root+pattern 

• Function 

- Derivational 

• Creating new words 

• Mostly templatic 

- Inflectional 

• Modifying features of words 

- Tense, number, person, mood, aspect 

• Mostly concatenative 
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Derivational Morphology 



• Templatic Morphology 




u ma i a 



Lexeme 



S-J a 

maktub 

written 









Lexeme. Meaning = 

(Root. Meaning+Pattern. Meaning) * idiosyncrasy . Random 



katib 

writer 



Derivational Morphology 

Root Meaning 

• v o ^ KTB = notion of “writing” 



/kitab/ 

book 



QjJiSuO 

/maktaba/ 
library 

uO 

/maktab/ 
office 



/katab/ 

write 



vs 






/maktub/ 

letter 



/maktub/ 
written 

/katib/ 
writer 
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Derivational Morphology 

Root Meaning 

• LHM-1 

• Notion of “meat” 

— ^ /lafim/ 

• Meat 

— /lahham/ 

• Butcher 
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Derivational Morphology 

Root Meaning 



• LHM-2 

• Notion of “battle” 

/ malfiama/ 

• Fierce battle 

• Massacre 

• Epic 





Derivational Morphology 

Root Meaning 



• LHM-3 

• Notion of “soldering” 

— ^ /lafiam/ 

• Weld, solder, stick, cling 

— /iltafiam/ 

• Be welded/soldered/fused 

— /multafiim/ 

• Welded, soldered, fused 
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Derivational Morphology 

Pattern Meaning 



• Verb Pattern Meaning is hard to define 



Pattern 


Pattern Meaning 


Example 


Gloss 




Ia2a3 


Basic sense of root 


ktb -> katab 


write 


II 


Ia22a3 


Intensification, causation 


ktb -> kattab 


dictate 


III 


laA2a3 


Interaction with others 


ktb -> kaAtab 


correspond with 


IV 


Aal2a3 


Causation 


jls -> Ajlas 


seat 


V 


tala22a3 


Reflexive of Pattern II 


Elm -> taEal~am 


learn 


VI 


talaA2a3 


Reflexive of Pattern III 


ktb -> takaAtab 


correspond 


VII 


Ainla2a3 


Passive of Pattern 1 


ktb -> Ain katab 


subscribe/enroll 


VIII 


A±lta2a3 


Acquiescence, exaggeration 


ktb -> Aiktatab 


register 


IX 


Ail2a33 


Transformation 


Hmr -> AiHmarr 


Turn red/blush 


EM 


Aistal2a3 


Requirement 


ktb -> Aistaktab 


ask/make_write 
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Inflectional Morphology 

Derivational Morphology 

- Lexeme « Root + Pattern 

Inflectional Morphology 

- Word = Lexeme + Features 

Features 

- Part-of-speech 

• Traditional : Noun, Verb, Particle 

• Computational : N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det, 
Aux, Pun, I J, and others 

- Noun-specific 

• Number: singular, dual, plural, collective 

• Gender: masculine, feminine, Neutral 

• Definiteness: definite, indefinite 

• Case: nominative, accusative, genitive 

• Possessive clitic 
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Inflectional Morphology 



Features (continued) 

- Verb-specific 

• Aspect: perfective, imperfective, imperative 

• Voice: active, passive 

• Tense: past, present, future 

• Mood: indicative, subjunctive, jussive 

• Subject (Person, Number, Gender) 

• Object clitic 

- Others 

• Single-letter conjunctions 

• Single-letter prepositions 




Inflectional Morphology 

Nouns 







/wakabiyutina/ 

^ ^ + j 

wa+ka+biyut+na 
and+like+houses+our 
And like our houses 



CjLuS-all j 

/walilmaktabat/ 

dil+4 

wa+li+al+maktaba+at 
and+for+the+library+plural 
And for the libraries 



Morphotactics (e.g. J'+J -> J) 
Arabic Broken Plurals (templatic) 



Inflectional Morphology 

Verbs 

object Hn subj j^verb HntenseHn. con j ) 



UUlaa 



/faqulnaha/ 
U +li +Jta +<_i 

fa+qul+na+ha 
so+said+we+it 
So we said it. 



/wasanaquluha/ 

U + +(j +(_>u + J 

wa+sa+na+qul+u+ha 
and+will+we+say+it 
And we will say it 



Morphotactics 

Subject conjugation (suffix or circumfix) 
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Inflectional Morphology 



• Perfect verb subject conjugation ( suffixes only) 





Singular 


Dual 


Plural 


1 


katabtu 


tjfS katabna 


2 


katabta 


katabtuma 


katabtum 


3 


kataba 


kataba 


katabtu 


• 


Imperfect verb subject conjugation (prefix+ suffix) 




Singular 


Dual 


Plural 


1 


aktubu 


naktubu 


2 


taktubu 


taktuban 


ujf& taktubun 


3 


yaktubu 


yaktuban 


(jja^aa yaktubGn 
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Feminine form and other verb moods not shown 
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Morphological Ambiguity 

Derivational ambiguity 

- basis/principle/rule, military base, Qa'ida/Qaeda/Qaida 

Inflectional ambiguity 

- you write, she writes 

- Segmentation ambiguity 

• he found; j: and+grandfather 

. Aiil; AiJ+J: for a language; aaII'+J: for the language 

Spelling ambiguity 

- Optional diacritics 

• /katib / writer , /katab/ to correspond 

- Suboptimal spelling 

£ 

• Hamza dropping: 

• Undotted ta-marbuta: » -> » 

• Undotted final ya: -> l $ 




Morphological Ambiguity 

• Multiple sources of ambiguity 

- /bayyana/ Verb he declared/demonstrated 

- / bayyanna/ Verb they [feminine] declared/demonstrated 

- / bayyin/ Adj clear/evident/explicit 

- /bayna / Prep between/among 

- / biyin/ Proper Noun in Yen 

- /biyn/ Proper Noun Ben 

• Hard to measure specific causes of ambiguity 

- Derivational ambiguity* (diacritized tokens) 

• 1 .09 entries/token 

• 1.01 entries/token (within same part-of-speech) 

- Spelling ambiguity* (undiacritized tokens) 

• 1 .28 entries/token 

• 1.08 entries/token (within same part-of-speech) 
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* in Buckwalter’s Lexicon (~ 40,000 lexemes) 




Morphological Ambiguity 

Average overall ambiguity* is 2.5 analyses/word 

• Compare to English ENGTWOL ambiguity (1.7-2. 2 analyses/word) 

40% 

35% 

f 30% 

o 

5 25% 

H- 

0 

§, 20% 

1 15% 

0 

1 10% 

5% 

0% 

1 2 3 4 5 6 78 or 

more 
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* In Arabic Penn Treebank 1 




Analyses/Word 
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Arabic Computational Morphology 

• Representation units 

• Natural token ol 

-White space separated strings (as is) 

-Can include extra characters (e.g. tatweel/kashida) 

• Word o L&oJUs 

• Segmented word 

-Can include any degree of morphological analysis 
-Pure segmentation: 

-Arabic Treebank tokens (with recovery of some 
deleted/modified letters): 
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Arabic Computational Morphology 

• Representation units (continued) 

• Prefix + Stem + Suffix 

— ol + v^J0+ JJs 

-Can create more ambiguity 

• Lexeme + Features 

— cijjtSuo[+ Plural +Def + 3 + J] 

• Root + Pattern + Features 

— v_iJlS + e>a3a21a/> + [+ Plural +Def +J + 3 ] 
-Very abstract 

• Root + Pattern + Vocalism + Features 

— v_iJlS + 6321/> + a. a. a + [+Plural +Def +J + 3 ] 
-Very very abstract 
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Arabic Computational Morphology 



Approaches 

- Finite state machines (Beesely,2001) (Kiraz,2001) (Habash etal, 2005b) 

- Concatenative analysis/generation (Buckwiater,2002) (Cavaiii-Sforza et 
al, 2000) 

- Lexeme+Feature analysis/generation (Habash, 2004 ) 

- Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002) 

- Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003) 
(Habash & Rambow 2005a) 

Issues 

- Appropriateness of system representation for an application 

• Machine Translation vs. Information Retrieval 

• Arabic spelling vs. phonetic spelling 

- System coverage 

- System extendibility 

- Availability to researchers 

- Use for analysis and generation 
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Morphology and Syntax 

• Rich morphology crosses into syntax 

- Pro-drop / Subject conjugation 

- Verb subcategorization and object clitics 

• Verb transitive +sub ject+object 

• Verb intransitive +Sub ject but not Verb intransitive +subject+object 

• Verb passive +subject but not Verb passive +subject+object 

• Morphological interactions with syntax 

- Agreement 

• Full: e.g. Noun-Adjective on number, gender, and definiteness 

• Partial: e.g. Verb-Subject on gender (in VSO order) 

- Definiteness 

• Noun compound formation, copular sentences, etc. 

• Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc. 
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Morphology and Syntax 

• Morphological interactions with syntax (continued) 

- Case 

• MSA is case marking: nominative, accusative, genitive 

• Al most-free word order 

• Case is often marked with optionally written short vowels 

- This effectively limits the word-order freedom in published text 

• Agglutination 

- Attached prepositions create words that cross phrase 
boundaries 

oLiSuoJI+J li+Almaktabat 

for the-libraries [PP li [NP Almaktabat]] 

• Some morphological analysis ( minimally segmentation) 
is necessary even for statistical approaches to parsing 
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Sentence Structure 



Two types of Arabic Sentences 

• Verbal sentences 

- [Verb Subject Object] (VSO) 

Wrote the-boys the-poems 
The boys wrote the poems 

• Copular sentences 

- [Topic Complement] 

the-boys poets 
The boys are poets 
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Sentence Structure 



Verbal sentences 

- Verb agreement with gender only 

. jVjVMjII wrote 3MascSing the-boy/the-boys 

. wrote 3FemSing the-girl/the-girls 

- Pronominal subjects are conjugated 

. wrote-you MascSing 
. wrote-you MascPlur 

. wrote-they MascPlur 

- Passive verbs 

• Same structure: Verb passive Subject underlying0bject 

• Agreement with surface subject 



Sentence Structure 



Verbal sentences 



- Common structural ambiguity 

• Third masculine/feminine singular are structurally 
ambiguous 

— ®^^3MascSingular ^^LJ^Masc 

Verb subject=he object=Noun 
Verb subject=Noun 



• Passive and active forms are often similar in 
standard orthography 

- /kataba/ he wrote 



- /kutiba/ it was written 




Sentence Structure 



Copular sentences 

- [Topic Complement] 

Definite Topic, Indefinite Complement 

the-boy poet 
The boy is a poet 

- [Auxiliary Topic Complement] 

Auxiliaries ( kana and her sisters) 

• Tense, Negation, Transformation, Persistence 

• tjtU; jijli oLS was the-boy poet The boy was a poet 

. ijcUi jjjii (jj is-not the-boy poet The boy is not a poet 

- Inverted order is expected in certain cases 

• Indefinite topic 

JjS /9andi kitabun/ at-me a-book / have a book 

•* 



Sentence Structure 



• Copular sentences 

- Types of complements 

• Noun/Adjective/Adverb 

_ ^ jJji! the-boy smart The boy is smart 

• Prepositional Phrase 

- ^ Aj\\ the-boy in the-library The boy is in the library 

• Copular-Sentence 

- A ji! [the-boy [book-his big]] The boy, his book is big 

• Verb-Sentence 

[the-boys [wrote-they poems]] The boys wrote the poems 

- Full agreement in this order (SVO) 

[the-poems [wrote-it the boys]] The poems, the boys wrote 
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Phrase Structure 



Noun Phrase 

- Determiner Noun Adjective PostModifier 

. rfjutt (> £ j-aiail |j>A 

this the-writer the-ambitious the-arriving from Japan 
This ambitious writer from Japan 

- Noun-Adjective agreement 

• number, gender, definiteness 

- the-writer fem the-ambitious fem 

- the-writer femP | Ur the-ambitious femPlur 




Phrase Structure 



• Noun Phrase 

- Idafa construction 

• Nouni of Noun2 encoded structurally 

• Nouni -indefinite Noun2-definite 

king Jordan 

the king of Jordan /Jordan’s king 

- Nouni becomes definite 

• Agrees with definite adjectives 

- Idafa chains 

• N 1 N 2 Nh - 1 N n 

indef indef indef def 

son uncle neighbor chief committee management the- 
company 

The cousin of the CEO’s neighbor 
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Phrase Structure 



• Morphological definiteness interacts with syntactic structure 







Word 1 writer 






definite 


Indefinite 




0 
-t— » 


Noun Phrase 


Noun Compound 


CO 


jUall e_ul£]l 


jUall <_ul£ 


■ 

CO 

*<n 


0 

~n 


The artist(ic) writer 


The writer of the artist 


3 

o 








CM 

"D 


0 
1 j 


Copular Sentence 


Noun Phrase 


s_ 

O 




<jli3 


jUa 




0 

"D 


The writer is an artist 


An artist(ic) writer 
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Computational Resources 



Monolingual corpora for building language models 

- Arabic Gigaword 

• Agence France Presse 

• AlHayat News Agency 

• AnNahar News Agency 

• Xinhua News Agency 

- Arabic Newswire 

- United Nations Corpus (parallel with other UN languages) 

- Ummah Corpus (parallel with English) 

Distributors 

- Linguistic Data Consortium (LDC) 

- Evaluations and Language resources Distribution Agency 
(ELDA) 




Computational Resources 



• Penn Arabic Treebank (PATB) 

- Started in 2001 

- Goal is 1 Million words 



- Currently 650K words 

• Agence France Presse , AlHayat newspaper, AnNahar 



newspaper 

• POS tags 

- Buckwalter analyzer 

- Arabic-tailored POS list 

• PATB constituency 
representation 






B-vp 
B-prt 
| l-i a 

— ta+t^asiE 
[g-hlP^SEJ 

t ftl+*marAkiz 



- Some modifications of Penn English Treebank 



• (e.g. Verb-phrase internal subjects) 
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Computational Resources 



• Prague Dependency Treebank 



• Currently 100k words 

• Partial overlap with PATB 
and Arabic Gigaword 

- Agence France Presse, 
AlHayat and Xinhua 

• Morphological analysis 

- Similar to PATB 



* 



???_Pred 
ya+SoEad 
(hie) gets on 



$ 



o 

r 

s 

-huwa 



Sb 



he 



o 

Obj 

Al+bAS 
thie bus 



AuxY 

wa- 

and 



• Dependency representation 



Graphic courtesy of Otakar Smrz: http://ckl.mff.cuni.cz/padt/PADT 1.0/docs/slides/2003-eacl-trees.ppt 



Computational Resources 



• Applications using Penn Arabic Treebank 

- Statsitical parsing 

• Bikel’s parser (Bikel 2003) 

- Same engine used with English, Chinese and Arabic 

- POS tagging and morphological disambiguation 

• (Diab et al, 2004) and (Habash and Rambow, 2005a) 

• Arabic pos tagging (Khoja, 2001 ) 

• Formalism conversion 



- Constituency to dependency (Zabokrtsky and Smrz 2003) 

- Tree-adjoining grammar extraction (Habash and Rambow 



2004) 

• Automatic diacritization 
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Morphology and Translation 
which level to go down to? 

• Natural token ol 

• Word oLi5uoJU3 

• Segmented Word oL&oJI J 3 

• Prefix + Stem + Suffix oI+v_aiSuo+JJ 3 

• Lexeme + Features cuiSLo [+piurai +Def +j + 5 ] 

• Root + Pattern + Features 

o vil + 6c)3d21cl^o "I - [+Plural +D 6 f +J + 3 ] 
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Morphology and Translation 
What approach? 

• Natural token Not Appropriate 

• Word Statistical MT 

• Segmented Word Statistical MT 

• Prefix + Stem + Suffix Statistical/Symbolic 

• Lexeme + Features Symbolic MT 

• Root + Pattern + Features Too Abstract? 
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Morphology and Translation 

What resources? 

• Available resources may span different levels of 
representation! 

• Most dictionaries are lexeme-based 

• Buckwalter stem dictionary contains English glosses 

• Statistical translation lexicons depend on the type of 
tokenization used before alignment 

- Word (no disambiguation necessary) 

- Segmented word (minimal disambiguation necessary) 

- Stem/Lexeme (machine/human disambiguation necessary) 

• Consistency is important 
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Translation Divergences 

• Beyond word-order variation 

- Arabic VSO - English SVO 

- Arabic N Adj - English Adj N 

• Meaning of two translationally equivalent constituents is 
distributed differently in two languages 

• Divergence dimensions 

- Categorial Variation (develop development) 

- Conflation (become frozen -> freeze) 

- Inflation (freeze -> become frozen) 

- Structural (enter the room -> enter into the room) 

- Head Swap (swim across the river -> cross the river swimming) 

- Thematic (John likes Mary -> Mary pleases John) 
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Translation Divergences 

conflation 





have a book 



* 

at-me book 
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Translation Divergences 

conflation 





I am not here 

-am-not here 
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Translation Divergences 

structural 




Jj j 

book Nizar 




Nizar’s book 
Book of Nizar 
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Translation Divergences 

structural 





Cj jiic i found the book 

found-I upon the-book 
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Translation Divergences 

thematic & conflational 




head-my hurts-me my head hurts 




I have a headache 
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Translation Divergences 

head swap and categorial 




I swam across the river quickly 

I-sped crossing the-river swimming 
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Computational Resources 

Dictionaries 

- Buckwalter stem dictionary (LDC) 

- Salmone dictionary (Tufts university) 

- Online dictionaries - Ajeeb.com (Sakhr), Almisbar.com, 
Ectaco.com 

Parallel corpora (LDC) 

- United Nations Corpus (parallel with other UN languages) 

- Ummah Corpus (parallel with English) 

- Arabic News Translation Corpus 

- Arabic Treebank English Translation 

- More on LDC webpage . . . 

MT evaluation 

- Arabic-English Multi-translation Corpus (LDC) 

- NIST’s MT-EVAL 

• Statistical MT systems are the state-of-the-art 
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lam jajtari nizar tawilatan ^ad id a tan 






didn’t buy Nizar table new 

nizar majtara/ tarabeza gidTda 
nizar majtaraj tawile ^dlde 
nizar majra/ mida £dTda 



. .» » t i* 1 ** «*1 1 • • 

jl JJ 

AijUa ^)!^j 

S-llA 



Nizar not-bought-not table new 
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General Definitions 



What is a ‘dialect’? 

- Political and Religious factors 

Modern Standard Arabic 

Regional Dialects 

- Egyptian Arabic (EGY) 

- Levantine Arabic (LEV) 

- Gulf Arabic (GULF) 

- North African Arabic (NOR) 

- Iraqi, Yemenite, Sudanese, Maltese? 

Social dialects 

- City 

- Peasant 

- Bedouin 




General Definitions 



• Diglossia 

• Badawi’s levels 



- Traditional Arabic 

- Modern Arabic - 



- Educated Colloquial 

- Literate Colloquial 

- Illiterate Colloquial 

Polyglossia 






Classical Dialect Foreign 96 
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Phonological Variation 

MSA 




ljuwhnm 1 kq f k 

^ ■ ■ S J szrSdxhcfcBtba? 

LEV 



(J> J ^ J til Jj ui £ £ la Ja (jia I (j {$ j ] 1 1 f 




UW h 



e o 



z 



• No dialect-specific standard orthography 



Lexical Variation 

• Arabic Dialects vary widely lexically 



English 


table 


cat 


of 


(I) want 


there is 


there isn't 


MSA 


Tawila 




qiTTa 




idafa 


Airidu 




yujadu 


la yujadu 


Moroccan 


mida 




qeTTa 




dval 


bgit 


kayn 
. 


tna kavns 




Egyptian 


Tarabeza 


’oTTa 




bita3 


3awez 


a 


mails 


Syrian 


Tawle 




bisse 




taba3 


biddi 


fi 


ma fi 


Iraqi 


mez 




bazzuna 


mal 


’arid 




aku 


rnakii 



• Arabic orthography allows consolidating some 
variations 
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Morphological Variation 

Nouns 

- No case marking 

• Word order implications 

- Paradigm reduction 

• Consolidating masculine & feminine plural 

Verbs 

- Paradigm reduction 

• Loss of dual forms 

• Consolidating masculine & feminine plural (2 nd , 3 rd person) 

• Loss of morphological moods 

- Subjunctive/jussive form dominates in some dialects 

- Indicative form dominates in others 

- Other aspects increase in complexity 




Morphological Variation 

Verb Morphology 




MSA 

<1 lA ji&i j 

walam taktubuha lahu 
wa+lam taktubu+ha la+hu 

and+not_past write_you+it for+him 



EGY 

wimakatabtuhaluj 
wi+ma+katab+tu+ha+ u+J 

and+not+wrote+you+it+for_him+not 



And you didn’t write it for him 
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Morphological Variation 

Verb conjugation 



• Perfect verb derivation ( suffixes only) 





1 st Person Singular 


2 nd Person 
Singular $ 


2 nd Person 
Singular 5 


MSA 


katabtu 


katabta 


katabti 


LEV 


katabt 


katabti 



• Imperfect verb derivation ( prefix+suffix ) 





1 st Person Singular 


2 nd Person 
Singular $ 


2 nd Person 
Singular $ 


MSA 


aktubu 


taktubu 


taktubTna 

taktubT 


LEV 


aktob 


toktob 


toktobi 



Morphological Variation 



Tense expression 





Perfect 


Imperfect 


M 

S 

A 


L 


l a 






l 1 1 ii 


* 

kataba 

Past 


♦ ♦♦ 

jaktubu 

Present 






* ♦♦ 

sajaktubu 

Future 


L 

E 

V 


♦ 

katab 

Past 


♦ ♦* 
jiktob 
0-Tense 


L- 

* *♦* 
bjoktob 

Present 

habitual 


l r^i.i ac. 

• %* * \ 

9am bjoktob 

Present 

progressive 


L. 

• *♦ 

bajiktob 

Future 
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Syntactic Variation 

• Verbal sentences 

- The children wrote poems 

- MSA 

• Verb Subject Object (Partial agreement) 

wrote masc the-boys the-poems 

• Subject Verb Object (Full agreement) 

the-boys wrote mascPlural the-poems 

- LEV, EGY 

• Subject Verb Object 

The-boys wrote mascPlural the-poems 

• Less present: Verb Subject Object 

wrote rnaS cp| Ura | the-boys the-poems 

• Full agreement in both order 
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Syntactic Variation 

Noun Phrase 

- Idafa construction 

• Nouni of Noun2 encoded structurally 

king Jordan 

the king of Jordan /Jordan’s king 

- Dialects have an additional common construct 

• Nouni <particle> Noun2 

• LEV: jVi £& the-king belonging-to Jordan 

• <particle> differs widely among dialects 

- Pre/post-modifying demonstrative article 

• MSA: this the-man this man 

• EGY: ^ jil the-man this this man 
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Code Switching 



MSA and Dialect mixing in speech 

• phonology, morphology and syntax 



MSA 

LEV 



^jl^Jl AjAaIiIIj I^OOa ^^lll 3^jal o^jii AjAaIi ^ajOl l^jJa^)Lt_iJ ^C- ^111 4 iLaC. AjV J La LI V 

AjIj ^j^a^U AOal^jLad a^loj ^3 UJ& ^ 01 (_pa^)VI (_glc- (^Aia p^da^A Aka pjda^jA ^JOlOj 

Ajjj£I jl jOO ^ Jill Ail AAixJj AOal jLuA 4-ui jIaa ^0 jjfO (jl j AOal jIajaII A Jail ^»l JJal ^ 

*< ^aJtj *<» j A^ » II dll jlaj] p^da^A ^ic. Aiaal £a^)J (_£A J ^jdJ tp^da^all lAA Aj^jIi (jOO ^_g3 AaaLuo 
^»Oaj C—fljUall A*J j}A (jOO ^_g3 al laill ^_gjuAj^) ^aOaj (jOO ^_g3 ^al IVill (Ja jj^l A^ x ll dll jlajj (jc. 

(jjfLiJ Lai AjO SjokV! Akoj^jLaA (Jjlk (■" ml 3^al (_pokl^}]lj AxAliaA ‘Laj^^ll AlJ OLaC- ^A A JaLOl ^OlOj 
dVOaliVI p^da^A ^3 ^koi^jLaAJ Lu^ak d p^dajAll lAA diOc. Olj (jjXA ^ ^3 Q j j-iuiA (_j-akd ^g3 



j^A l _ ij I L> ^ ^jka Lajl Ajjla ^^Jl ^jA ^>>aII (. _ A laa (_£30 aj (. _ A laa (j-«^ a Aalda (. «a\ ja AaOj Lai 
AjA iVnll AiaLudll (jjoklj (djOall (JjOjI Axj La (jOk ^3 ^Li Aka AjV AjAiAliII AiaLudll (jjdj jjjfO ^A A-j jjg a~n 
A ialdll Adla^ll 3j^a jj ajj AOc. ^aa jA Laj Oak jA La OjOl aOc- dOaajLall $.IAjI AjIc. A^ajill aOc- 
La aLII lAA f.lki (jjJaiaj j)O0 ^^3 ^a-iuiAl! j jLaaII jjjj La jjsljj ^3 ^ 5 ^ Adlaj AaJl Loa ^3 OJk 

Jll L^j3 ^ 0 ^® ^A Clla^la [_^3Li<a p jda j^a jl£ ^jjjOI C - A Jak Lajj Oakll oLklL ^ j^)J jLoiOl LI^jjj 

Lalj l^_i3 dLa^iill ^gjl AjA^jkaJl 'Lai^jLaAjlj dll j)Uj.i ^j^jVI O^lk dull! Oi l^_i3 IjA^iill l^_l3 l^kalj Ajusi I jJ-da 

Vij 01 ^gJal^)kajAll p^jJa^all Lai ip^jka^all lAA ^g3 I Ida, ^gll ^^al ^_poJJ^)ll p^jJa^all Ia^j Oa^OI 
L_llklul 6 Alc.| ^j 3 Adlk-al jl ^jA AliAki jl ^)jkaiAil AjI O^jO (j^-a-a La (_pdJ ^jOkkll Aga^llA lAA LaLali 
^)A^a ^3 AliA ^jud-a ^A AkO AjV^jJ Aj^^^Aa. LUOa La ^^Jl dli^Aalillj ^jOa aII ^jAjJa ^gial^)kaj3 

_pjjJa^All lAA (^3 ^gJC-03 ^ lx. j O^VO lAA AOal^)LajAll 
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Aljazeera T ranscript http://www.aljazeera.net/programs/op_direction/articles/2004/7/7-23-1 .htm 
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Computational Resources 

• Most work on Arabic dialects focuses on Automatic 
Speech Recognition 

• Speech/transcript corpora 

- Egyptian and Levantine Arabic (LDC) 

- Moroccan and Tunisian Arabic (ELDA) 

- Gulf Arabic (Appen) 

- Many other... 

• Few lexicons/morphology resources 

- CallHome Egyptian Arabic monolingual lexicon (LDC) 

- CallHome Egyptian Verb transducer (LDC) 

• Work on multi-dialectic resources 

- Linguistic Data Consortium 

- Columbia University Arabic Dialect Project 

• Pan-Arab lexicon and Pan-Arab Morphology 

• Parsing Arabic Dialects (JHU summer workshop 2005) 111 




Resources 



Distributors 

• Linguistic Data Consortium 

• NEMLAR (Network for Euro-Mediterranean LAnquaqe 
Resources) 

• ELSNET is the European Network of Excellence in 
Human Language Technologies 

• ELDA Evaluation and Language resources Distribution 
Agency 
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Resources 



Reports 

• Mohamed Maamouri and Christopher Cieri. 2002. 
Resources for Natural Language Processing at the 
Linguistic Data Consortium . In Proceedings of the 
International Symposium on Processing of Arabic, pages 
125--146, Manouba, Tunisia, April 2002. 

• Mahtab Nikkhou and Khalid Choukri. Survey on Arabic 
Language Resources and Tools in the Mediterranean 
Countries . 

• Arabic Information Retrieval and Computational 
Linguistics Resources (thanks to Doug Oard) 
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Resources 



Monolingual Corpora 

• Arabic Giqaword 

• Arabic News wire 

Parallel Corpora 

• United Nations Parallel Corpus 

• Ummah Parallel Corpus 

• Arabic News Translation 

• Multiple-Translation Arabic 

Treebanks 

• Arabic Penn Treebank Webpage 

- Part 1 v 2.0 , Part 2 v 2.0 , Part 3 v 1 .0 , IQK-word English Translation 

• Prague Arabic Dependency Treebank 
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Resources 



Morphology 

• Buckwalter Arabic Morphological Analyzer 

- Version 1.0. Version 2.0 

• Xerox Arabic Morphology (online) 

Dialect Resources 

• CALLHOME Egyptian Arabic Transcripts 

• CALLHOME Egyptian Arabic Speech 

• Egyptian Colloquial Arabic Lexicon 

• Levantine Arabic Resources 

• http://www.orientel.org/ 

• http://www.appen.com.au 
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Resources 



Dictionaries 

• Buckwalter Stem Dictionary 

• H. Anthony Salmone. An Advanced Learner's Arabic- 
English Dictionary encoded by the Perseus Project, Tufts 
University (contact: David Smith dasmith@perseus.tufts.edu) 

• Aieeb Arabic-Enqlish Dictionary (online) 

• Al-Misbar Dictionar (online) 

• Ectaco Bilingual Dictionar (online) 

Online MT systems 

• Aieeb's Arabic-Enqlish Machine Translation (online) 

• Al-Misbar Enqlish-Arabic Machine Translation (online) 
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Conferences and Workshops 

with some focus on Arabic 



• ACL 2005 Workshop on Computational Approaches to Semitic Languages 

• Arabic Language Resources and Tools Conference 2004 Cairo, Egypt 

• WORKSHOP Computational Approaches to Arabic Script-based Lanquaqes 
(COLING 2004) 

• Traitement Automatique du Lanqaqe Naturel (TALN ' 04) 

• NIST MT EVAL ( http://www.nist.gov/speech/tests/mt/ ) 

• MT Summit IX Workshop on Machine Translation for Semitic Languages in 
2003 

• LREC 2002 Arabic Language Resources and Evaluation Workshop 

• ACL 2002 Workshop on Computational Approaches to Semitic Languages 

• International Symposium on Processing of Arabic 2002, Tunisia 

• Workshop on ARABIC Language Processing: Status and Prospects 
(ACL/EACL 2001) 

• Arabic Translation and Localisation Symposium (ATLAS 1999) 

• Computational Approaches to Semitic Languages (COLING/ACL 1998) 
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